Method and apparatus for controlling the flow of data between servers using optimistic transmitter

ABSTRACT

Link-based flow control requires each link transmitter to retain packets until such time as they are acknowledged by the link receiver. Depending on the type of acknowledge, the transmitter will then either retry or de-allocate the packets. To improve throughput, the present invention includes an optimistic transmitter, which transmits packets without knowing the state of the receiver buffer. By so doing, the present invention improves the latency caused by delays in transit time between nodes. Furthermore, single acknowledgments are used to indicate successful receipt of multiple packets. Single negative acknowledgments are used to indicate successful receipt of all data between a last acknowledged data packet and a packet associated with the negative acknowledgment, which was received with errors.

RELATED APPLICATIONS

[0001] This application claims priority to U.S. Provisional ApplicationNo. 60/[Attorney Docket Number 42390.P4879Z], filed on Aug. 29, 1997,entitled “Method and Apparatus for Communicating Between InterconnectedComputers, Storage Systems, and Other Input/Output Subsystems” byinventors Ahmet Houssein, Paul A. Grun, Kenneth R. Drottar, and David S.Dunning, and to U.S. Provisional Application No. 60/081,220, filed onApr. 9, 1998, entitled “Next Generation Input/Output” by inventorsChristopher Dodd, Ahmet Houssein, Paul A. Grun, Kenneth R. Drottar, andDavid S. Dunning. These applications are hereby incorporated byreference as if repeated herein in their entirety, including thedrawings. Furthermore, this application is related to U.S. patentapplication Ser. No. [Attorney Docket Number 2207/4974] filed by DavidS. Dunning and Kenneth R. Drottar on even date herewith and entitled“Method and Apparatus for Controlling the Flow of Data Between Servers.”

BACKGROUND OF THE INVENTION

[0002] The present invention relates generally to methods andapparatuses for controlling the flow of data between two nodes (or twopoints) in a computer network, and more particularly to a method andapparatus for controlling the flow of data between two nodes (or twopoints) in a system area network.

[0003] For the purposes of this application, the term “node” will beused to describe either an origination point of a message or thetermination point of a message. The term “point” will be used to referto an intermediate point in a transmission between two nodes. Thepresent invention includes communications between either a first nodeand a second node, a node and a switch, which is part of a link, betweena first switch and a second switch, which comprise a link, and between aswitch and a node.

[0004] An existing flow control protocol, known as Stop and Wait ARQ,transmits a data packet and then waits for an acknowledgment (ACK) fromthe termination node before transmitting the next packet. As datapackets flow through the network from node to node, latency becomes aproblem. Latency results from the large number of links in the fabricbecause each packet requires an acknowledgment of successful receiptfrom the receiving node before the next packet can be sent from thetransmitting node. Consequently, there is an inherent delay resultingfrom the transit time for the acknowledgment to reach the transmittingnode from the receiver.

[0005] One solution, which is known as Go Back n ARQ, uses sequentiallynumbered packets, in which a sequence number is sent in the header ofthe frame containing the packet. In this case, several successivepackets are sent up to the limit of the receive buffer, but withoutwaiting for the return of the acknowledgment. According to thisprotocol, the receiving node only accepts the packets in the correctorder and sends request numbers (RN) back to the transmitting node alongwith the flow control information, such as the state of the receivebuffer. The effect of a given request number is to acknowledge allpackets prior to the requested packet and to request transmission of thepacket associated with the request number. The go back number n is aparameter that determines how many successive packets can be sent fromthe transmitter in the absence of a request for a new packet.Specifically, the transmitting node is not allowed to send packet i+nbefore i has been acknowledged (i.e., before i+l has been requested).Thus, if i is the most recently received request from the receivingnode, there is a window of n packets that the transmitter is allowed tosend before receiving the next acknowledgment. In this protocol, ifthere is an error, the entire window must be resent as the receiver willonly permit reception of the packets in order. Thus, even if the errorlies near the end of the window, the entire window must beretransmitted. This protocol is most suitable for large scaled networkshaving high probabilities of error. In this protocol, the window size nis based on the size of the receive buffer. Thus, the transmitter doesnot send more data than the receiver can buffer. Consequently, at startup, the two nodes must transmit information to each other regarding thesize of their buffers—defaulting to the smaller of the two buffersduring operation.

[0006] In an architecture that permits large data packets, unnecessarilyretransmitting excess packets can become a significant efficiencyconcern. For example, retransmitting an entire window of data packets,each on the order of 4 Gigabytes, would be relatively inefficient.

[0007] Other known flow control protocols require retransmission of onlythe packet received in error. This requires the receiver to maintain abuffer of the correctly received packets and to reorder them uponsuccessful receipt of the retransmitted packet. While keeping thebandwidth requirements to a minimum, this protocol significantlycomplicates the receiver design as compared to that required by Go Backn ARQ.

[0008] The present invention is therefore directed to the problem ofdeveloping a method and apparatus for controlling the flow of databetween nodes in a system area network that improves the efficiency ofthe communication without overly complicating the processing at thereceiving end.

SUMMARY OF THE INVENTION

[0009] The present invention provides a method for transmitting datapackets from a first endpoint to a second endpoint, either directly orvia a fabric. The method of the present invention includes the steps oftransmitting the data from a first node in a plurality of packets, andtransmitting the data independently of a state of a receive buffer inthe second node.

[0010] The present invention also provides an apparatus forcommunicating data between a two endpoints coupled together eitherdirectly or via a fabric. The apparatus includes a first switch disposedin a first endpoint, a second switch and a buffer. The second switch canbe disposed either in the fabric or in the second endpoint. The firstswitch transmits the data packets in a plurality of packets from thefirst endpoint to the second switch independently of a state of areceive buffer in the second switch. The apparatus also includes abuffer located in the first endpoint, which buffer is coupled to thefirst switch and stores each packet until receiving either anacknowledgment that each packet was successfully received or an errorindication that a received version of each packet included at least oneerror.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011]FIG. 1 depicts two nodes communicating directly, to which themethod of the present invention is applicable.

[0012]FIG. 2 depicts one exemplary embodiment of the present invention,which includes two nodes communicating via a fabric or switch, to whichthe method of the present invention is applicable.

[0013]FIG. 3 depicts one exemplary embodiment of the present invention,which includes two nodes communicating via a series of links, to whichthe method of the present invention is applicable.

DETAILED DESCRIPTION

[0014] The present invention provides a simple technique for providing aworking network with flow control mechanisms that do not allow for lostdata due to congestion, or transient bit errors due to internal orexternal system noise. The present invention uses an approach to flowcontrol that does not require end-to-end or link-to-link credits, ratherthe present invention combines this with the ability to detect acorrupted or out of order packet and retry (resend) any/all packets tomaintain that all data is delivered uncorrupted, without losing any dataand in the order that the data was sent.

[0015] The present invention accomplishes this by assigning a sequencenumber to each packet, performing an error detection on each packet,such as calculating a 32-bit Cyclic Redundancy Check (CRC), andacknowledging (ACK) or negative acknowledging (NAK) each packet at eachlink within the fabric and not just at the endpoint. According to theprior art, all acknowledgments in a computer network occur at theendpoints, but not within each individual link, i.e., from oneintermediate point to another intermediate point, or from oneintermediate point to an endpoint, or from one endpoint to anintermediate point.

[0016] The present invention assumes a network built out ofpoint-to-point links. The minimum sized network is two endpointsconnected via one link, as depicted in FIG. 1. To simplify ourdiscussion, we will assume a one-way transmission of packets, fromendpoint A to endpoint B, except that endpoint B transmits either an ACKor a NAK back to endpoint A. For simplicity sake, two nodes 1, 2 (A andB in FIG. 1) in the network will be used to describe the presentinvention, noting that the present invention holds for an unlimitedsized network. The present invention assumes a send queue and receivequeue at each end of each link (however, for simplicity purposes thereceive queue at the transmitting end (A) is not shown, and the transmitqueue at the receive end (B) is not shown. Thus, in FIG. 1 we show anode A 1, coupled to node B 2 via a link. Node A 1 has a send buffer 3of length n, and node B 2 has a receive buffer 4 of length m.

[0017] For implementation of the present invention, the size of the sendqueue 3 need not match the size of the receive queue 4 (nor does thesend queue of node B (not shown) need to match the size of the receivequeue of node A (not shown)). In general, send queues will be largerthan receive queues, simply because the system can recover from datalost in a receive queue by retransmitting the data from the send queue,but the opposite is not possible.

[0018] In this example, the size of the send queue on node A is definedas n, and the size of receive queue on node B is defined as m. Node 1 isallowed to send up to n packets to the receive queue on node 2 becausethe sender only knows the size of its queue, as there is no handshakingduring power-up. Under congestion-free conditions, packets received atnode 2 will be processed and immediately passed on. Node 2 must sendback an ACK notifying node 1 that the packets have been receivedcorrectly by “ACKing” the sequence number. Note, that as an efficiencyimprovement to this algorithm, the receiver can ACK multiple packets atone time by ACKing the highest sequence number that has been correctlyreceived, e.g., if the sender receives an ACK for packet #9, thenreceives an ACK for packet #14, packets #10-#13 are also implicitlyACKed.

[0019] In the event of a transient error due to internal or externalsystem noise, data may be corrupted between the sending node (1) and thereceiving node (2). The receiving node must calculate the CRC across thedata received, and compare it to the CRC appended to the end of thepacket. If the calculated CRC and the received CRC match, the packetwill be ACKed. If the two CRC's do not match, that packet must be NAKed,again identified by the sequence number. Upon receipt of a NAK, thesender must resend the specified packet again, followed by all packetsfollowing that packet. For example, if the sender has sent packets up tosequence number 16 but receives a NAK for packet #14, it must resendpacket #14, followed by packet #15 and packet #16. Note that ACKs andNAKs can still be combined. Using the example above, if packet #9 isACKed, and assuming packets #10-#13 are received in order and withoutdata corruption, followed by packet #14 with corrupted data; a NAK ofpacket #14 signifies that packets #10-#13 were received without error,but that packet #14 was received with error and must be resent. Alsonote that the present invention does not force the network to operate ina store and forward fashion.

[0020] If congestion in the network occurs, received packets may not beable to immediately make progress through the switch/router. Therefore,when the local buffer space 4 is filled at the receiver B, additionalpackets will be lost, e.g., when receive buffer 4 fills up, packets thatfollow will be thrown away. However, given that retry can occur acrosseach link, packets being thrown away is relatively simply to recoverfrom. As soon as the receiver B 2 starts moving packets out of itsreceive buffer 4, it opens up room for additional packets to bereceived. The receiver B 2 will check the sequence number of the nextpacket it receives. In the event that the sender A 1 has sent packetsthat were dropped on the floor, the first dropped packet will be NAKedand therefore resent from that packet on.

[0021] According to the present invention, the sender 3 just keepssending packets until its sender 3 is full of packets that have not beenACKed. It must wait for an ACK for those packets before it can reusethose buffers 3 (it needs to be able to retry those packets ifnecessary). The present invention does not, however, as in the priorart, stop sending data when the receive queue 4 has filled. The presentinvention combines the flow control process (i.e., credits) with theerror detection process (ACK-NAKing). By eliminating the need forcredits to be transmitted, the present invention reduces the overhead ofthe flow control.

[0022] The advantages of the present invention are at least twofold.First, the present invention allows for retry of corrupted packets dueto bit error rates on a medium. Second, the present invention implementsflow control between two endpoints 1, 2 which will yield betterbandwidths for link efficiency than a traditional credit based flowcontrol—a credit base scheme stops sending packets when all credits areused up, and transmission cannot resume until additional credits arereceived. Whereas, the present invention continues sending data untilthe receiver sends a NAK, at which time the transmitter restarts at thepoint at which the receiver indicated the NAK. In the prior art, thetime to start and stop data transfer is dependent on the round trip timeof the traversing link, which is eliminated in the present invention.The present invention is optimistic in that it sends packets with theexpectation that they will be received correctly and is not dependent onthe round trip time of the link. In other words, the transmitter of thepresent invention operates independently of the state of the receiverbuffer.

[0023] The present invention is also less complex than implementingretry across a link as well as a credit based flow control mechanism.This scheme works regardless of the type of NIC, switch or routerarchitecture used. The hardware necessary to implement these mechanismsis relatively simple, as depicted in FIG. 1.

[0024] The present invention can be used in networks between servers aswell as across serial links used for I/O. FIG. 2 depicts two endpoints1, 2 (or nodes) A and B coupled via a fabric X 5, which has a receivebuffer 6. (For purposes of simplicity, buffer 6 will be used as areceive buffer when describing a transmission from A to X, and as a sendbuffer when describing a transmission from X to B). Fabric X can be asingle switch, multiple switches, multiple links, etc. The importantdistinction is that X includes a send/receive buffer 6, which enables Xto ACK/NAK data received from A, and to resend data to B withoutrequiring the data to be resent from A upon receipt of a NAK from B.

[0025]FIG. 3 depicts yet another possible embodiment of the presentinvention. In this case, two intermediate points X and Y includesend/receive buffers 6, 8. As discussed above, these buffers 6, 8 enableX and Y to ACK/NAK data between themselves without requiring the data tobe resent from the endpoints A and B. If the link between X and Y isparticularly noisy, such as a satellite link, then data can be resentfrom X to Y and from Y to X without notifying A and/or B. Endpoint A maycontinue sending data to X, while X retries sending data to Y, eventhough X is throwing the data on the floor, as A will simply continuesending until filling up its send buffer. Once full A will wait untilreceiving an ACK from X before deallocating packets from its buffer.Once A's buffer is full, A can turn to other tasks while X continues toretry data to Y.

[0026] Note that there is a strong trend to move to low voltagedifferential swing (LVDS) serial links as the most cost effective way totransfer data over zero to tens of meters (potentially kilometers ifoptical technology is used). That bit error rate combined with that datayields an average occurrence rate of errors too high to use forcommunication between IA computers without retry and flow control. Thepresent invention is applicable to LVDS bit serial data movement in areliable environment, including disk adapters/controllers for connectionto attached storage devices (NASD) and system area networks (SANs)inter-process communication (IPC).

[0027] The present invention adds intelligence to the switches in thefabric, which heretofore have not existed. By enabling the switches inthe fabric to include buffers and ACK/NAKing capability, the presentinvention significantly improves the latency problems in large networks.

What is claimed is:
 1. A method for transmitting data packets from afirst node to a second node, said method comprising the steps of: a)transmitting the data from a first node in a plurality of packets; andb) transmitting the data independently of a state of a receive buffer inthe second node.
 2. A method for transmitting data packets from a firstnode to a second node via at least one switch, said method comprisingthe steps of: a) transmitting the data packets from the first node tothe at least one switch independently of a state of a receive buffer inthe at least one switch; b) transmitting the data packets, which werereceived from a first node, to the second node from the at least oneswitch independently of a state of a receive buffer in the second node.3. A method for transmitting data packets from a transmitting switch toa receiving switch in a fabric, wherein said data packets are beingtransmitted through said fabric between two endpoints, comprising thesteps of: a) transmitting the data packets from the transmitting switchin the fabric to the receiving switch in the fabric; and b) transmittingthe data from the transmitting switch independently of a state of areceive buffer in the receiving switch.
 4. The method according to claim3, further comprising the step of: c) retaining each data packet in abuffer at a transmitting switch until receiving either an acknowledgmentindicating that said each data packet was successfully received by thereceiving switch or an error indication that a received version of saideach data packet received at the second switch included at least oneerror, while simultaneously transmitting additional packets from thetransmitting switch.
 5. The method according to claim 3, furthercomprising the step of transmitting at least one data packet from thetransmitting switch after a receiver buffer in the receiving switch hasfilled.
 6. The method according to claim 3, further comprising the stepof transmitting at least one data packet from the transmitting switcheven though a receiver buffer in the receiving switch is in an overflowstate.
 7. The method according to claim 4, further comprising the stepof: d) indicating successful receipt of all data packets between a lastacknowledged packet and a particular packet by sending a singleacknowledgment from the receiving switch to the transmitting switch forall said data packets between the last acknowledged packet and aparticular packet.
 8. The method according to claim 7, furthercomprising the steps of: e) de-allocating a particular packet in thebuffer at the transmitting switch upon receipt of an acknowledgmentassociated with said particular packet from the receiving switch; and f)de-allocating any other packets in the buffer at the transmitting switchbetween said particular packet and a last acknowledged packet.
 9. Themethod according to claim 8, further comprising the steps of: g)retransmitting said each packet and all subsequent packets upon receiptof an error indication from the receiving switch; and h) dropping allreceived packets following said each packet associated with the errorindication until successfully receiving a retransmitted version of saideach packet from the transmitting switch.
 10. A method for transferringdata across a fabric in a system area network including a plurality oflinks using a link to link protocol, said method comprising the stepsof: a) transmitting the data in a plurality of packets from link tolink; b) retaining each packet in a buffer at a transmitting link untilreceiving either an acknowledgment indicating that said each packet wassuccessfully received at a receiving link or an error indication that areceived version of said each packet received at the receiving linkincluded at least one error, while simultaneously transmittingadditional packets independently of a state of a receiver buffer in thereceiving link; and c) using a single negative acknowledgment toindicate that a packet associated with the negative acknowledgmentincludes at least one error and to simultaneously indicate that allprevious packets received at the receiving link prior to the packetassociated with the negative acknowledgment were received correctly. 11.The method according to claim 10, further comprising the step ofindicating successful receipt of all packets between a last acknowledgedpacket and a particular packet by sending a single acknowledgment forsaid particular packet and said all packets between the lastacknowledged packet and the particular packet.
 12. The method accordingto claim 10, further comprising the steps of: d) de-allocating aparticular packet in the buffer at the transmitting link node uponreceipt of an acknowledgment associated with said particular packet; ande) de-allocating any other packets in the buffer between said particularpacket and a last acknowledged packet.
 13. An apparatus forcommunicating data between two endpoints comprising: a) a first switchbeing disposed in a first endpoint and transmitting the data packets ina plurality of packets from the first endpoint to a second endpointindependently of a state of a receive buffer in the second endpoint; andb) a buffer being disposed in the first endpoint, being coupled to thefirst switch and storing each packet until receiving either anacknowledgment that said each packet was successfully received or anerror indication that a received version of said each packet included atleast one error.
 14. The apparatus according to claim 13, furthercomprising: c) a second switch being disposed in the second node,receiving each of the plurality of data packets, and upon receipt of anerror free packet sending an acknowledgment to indicate successfulreceipt of said error free packet and all previous error free packetsreceived in sequence between a last acknowledged packet and said errorfree packet.
 15. The apparatus according to claim 13, wherein the firstswitch de-allocates a packet in the buffer upon receipt of anacknowledgment associated with said packet in the buffer in addition toall packets preceding said packet in the buffer.
 16. The apparatusaccording to claim 13, wherein the first switch retransmits a particularpacket and all packets in sequence subsequent to the particular packetupon receipt of an error indication associated with said particularpacket.
 17. The apparatus according to claim 14, wherein said secondswitch drops all received packets in sequence following a corruptedpacket until successfully receiving a retransmitted version of saidcorrupted packet.
 18. An apparatus for communicating data between a twoendpoints coupled together via a fabric comprising: a) a first switchbeing disposed in a first endpoint; b) a second switch being disposed inthe fabric, wherein said first switch transmits the data packets in aplurality of packets from the first endpoint to the second switchindependently of a state of a receive buffer in the second switch; c) abuffer being disposed in the first endpoint, being coupled to the firstswitch and storing each packet until receiving either an acknowledgmentthat said each packet was successfully received or an error indicationthat a received version of said each packet included at least one error.18. An apparatus for communicating data between a first endpoint and asecond endpoint, said apparatus comprising: a) a first switch receivingdata being transmitted from the first endpoint as a plurality ofpackets, and transmitting the plurality of packets; b) a second switchbeing coupled to the first switch, receiving the plurality of packetsbeing transmitted by the first switch, being coupled to the secondendpoint and transmitting the plurality of packets to the secondendpoint; c) a first send buffer being coupled to the first switch, andstoring each transmitted packet until receiving an acknowledgment thatsaid each transmitted packet was successfully received by said secondswitch; d) a first receive buffer being coupled to the first switch andstoring data packets being transmitted from the first endpoint; e) asecond send buffer being coupled to the second switch, and storing eachtransmitted packet until receiving an acknowledgment that said eachtransmitted packet was successfully received by said second endpoint;and f) a second receive buffer being coupled to the second switch andstoring data packets being transmitted from the first switch, whereinthe first switch transmits data packets independently of a state of thesecond receive buffer.
 19. The apparatus according to claim 18, whereinthe second switch transmits data packets independently of a state of areceive buffer in the second endpoint.
 20. The apparatus according toclaim 18, wherein the first endpoint transmits data packetsindependently of a state of the first receive buffer.
 21. A programstorage device readable by a machine, tangibly embodying a program ofinstructions executable by a machine to perform method steps fortransmitting data between switches in a fabric, said method comprisingthe steps of: a) transmitting the data in a plurality of packets fromswitch to switch; b) retaining each packet in a buffer at a transmittingswitch until receiving either an acknowledgment indicating that saideach packet was successfully received or an error indication that areceived version of said each packet included at least one error, whilesimultaneously transmitting additional packets independently of a stateof a received buffer in a receiving switch.
 22. The device according toclaim 21, wherein said method further comprises the step of: c)indicating successful receipt of all packets between a last acknowledgedpacket and a particular packet by sending either a single acknowledgmentfor said all packets between a last acknowledged packet and a particularpacket or a single error indication for said all packets between a lastacknowledged packet and a particular packet, which indicates successfulreceipt of said all packets and corrupted receipt of said particularpacket.
 23. The device according to claim 22, wherein said methodfurther comprises the steps of: d) de-allocating a particular packet inthe buffer at the transmitting switch upon receipt of an acknowledgmentassociated with said particular packet; and e) de-allocating any otherpackets in the buffer between said particular packet and a lastacknowledged packet.
 24. The device according to claim 23, furthercomprising the steps of: d) retransmitting said each packet and allsubsequent packets upon receipt of an error indication; and e) droppingall received packets following said each packet associated with theerror indication until successfully receiving a retransmitted version ofsaid each packet.
 25. A method for transmitting data packets from afirst node to a second node, said method comprising the steps of: a)transmitting the data from a first node in a plurality of packets to asecond node without knowledge regarding a size of a receiver buffer; b)storing a copy of each transmitted packet at the first node untilreceiving an acknowledgment from the second node; c) ending transmissionof the data packets from the first node to the second node upon fillinga sender queue with packets that have not yet been acknowledged; and d)sending a negative acknowledgement to the first node from the secondnode upon receiving either an out-of-order packet or a packet with anerror.